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Abstract 

This paper describes a parsing model that 
combines the exact dynamic programming 
of CRF parsing with the rich nonlinear fea- 
turization of neural net approaches. Our 
model is structurally a CRF that factors 
over anchored rule productions, but in¬ 
stead of linear potential functions based 
on sparse features, we use nonlinear po¬ 
tentials computed via a feedforward neu¬ 
ral network. Because potentials are still 
local to anchored rules, structured infer¬ 
ence (CKY) is unchanged from the sparse 
case. Computing gradients during learn¬ 
ing involves backpropagating an error sig¬ 
nal formed from standard CRF sufficient 
statistics (expected rule counts). Us¬ 
ing only dense features, our neural CRF 
already exceeds a strong baseline CRF 
model (Hall et ah, 2014). In combination 
with sparse features, our system^ achieves 
91.1 Fi on section 23 of the Penn Tree- 
bank, and more generally outperforms the 
best prior single parser results on a range 
of languages. 

1 Introduction 

Neural network-based approaches to structured 
NLP tasks have both strengths and weaknesses 
when compared to more conventional models, 
such conditional random fields (CRFs). A key 
strength of neural approaches is their ability to 
learn nonlinear interactions between underlying 
features. In the case of unstructured output spaces, 
this capability has led to gains in problems rang¬ 
ing from syntax (Chen and Manning, 2014; Be- 
linkov et ah, 2014) to lexical semantics (Kalch- 
brenner et ah, 2014; Kim, 2014). Neural methods 
are also powerful tools in the case of structured 

'System available at http : // nip . cs . berkeley. edu 


output spaces. Here, past work has often relied on 
recurrent architectures (Henderson, 2003; Socher 
et ah, 2013; irsoy and Cardie, 2014), which can 
propagate information through structure via real¬ 
valued hidden state, but as a result do not admit ef¬ 
ficient dynamic programming (Socher et ah, 2013; 
Le and Zuidema, 2014). However, there is a nat¬ 
ural marriage of nonlinear induced features and 
efficient structured inference, as explored by Col- 
lobert et al. (2011) for the case of sequence mod¬ 
eling: feedforward neural networks can be used to 
score local decisions which are then “reconciled” 
in a discrete structured modeling framework, al¬ 
lowing inference via dynamic programming. 

In this work, we present a CRF constituency 
parser based on these principles, where individ¬ 
ual anchored rule productions are scored based 
on nonlinear features computed with a feedfor¬ 
ward neural network. A separate, identically- 
parameterized replicate of the network exists for 
each possible span and split point. As input, it 
takes vector representations of words at the split 
point and span boundaries; it then outputs scores 
for anchored rules applied to that span and split 
point. These scores can be thought of as non¬ 
linear potentials analogous to linear potentials in 
conventional CRFs. Crucially, while the network 
replicates are connected in a unified model, their 
computations factor along the same substructures 
as in standard CRFs. 

Prior work on parsing using neural network 
models has often sidestepped the problem of struc¬ 
tured inference by making sequential decisions 
(Henderson, 2003; Chen and Manning, 2014; 
Tsuboi, 2014) or by doing reranking (Socher et 
ah, 2013; Le and Zuidema, 2014); by contrast, our 
framework permits exact inference via CKY, since 
the model’s structured interactions are purely dis¬ 
crete and do not involve continuous hidden state. 
Therefore, we can exploit a neural net’s capac¬ 
ity to learn nonlinear features without modifying 
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Figure 1: Neural CRF model. On the right, each 
anchored rule (r, s) in the tree is independently 
scored by a function tj), so we can perform in¬ 
ference with CKY to compute marginals or the 
Viterbi tree. On the left, we show the process 
for scoring an anchored rule with neural features: 
words in fw (see Figure 2) are embedded, then fed 
through a neural network with one hidden layer to 
compute dense intermediate features, whose con¬ 
junctions with sparse rule indicator features fo are 
scored according to parameters W. 

our core inference mechanism, allowing us to use 
tricks like coarse pruning that make inference ef¬ 
ficient in the purely sparse model. Our model can 
be trained by gradient descent exactly as in a con¬ 
ventional CRF, with the gradient of the network 
parameters naturally computed by backpropagat- 
ing a difference of expected anchored rule counts 
through the network for each span and split point. 

Using dense learned features alone, the neu¬ 
ral CRF model obtains high performance, out¬ 
performing the CRF parser of Hall et al. (2014). 
When sparse indicators are used in addition, the 
resulting model gets 91.1 Fi on section 23 of 
the Penn Treebank, outperforming the parser of 
Socher et al. (2013) as well as the Berkeley Parser 
(Petrov and Klein, 2007) and matching the dis¬ 
criminative parser of Carreras et al. (2008). The 
model also obtains the best single parser results 
on nine other languages, again outperforming the 
system of Hall et al. (2014). 

2 Model 

Figure 1 shows our neural CRF model. The 
model decomposes over anchored rules, and it 
scores each of these with a potential function; in 
a standard CRF, these potentials are typically lin¬ 
ear functions of sparse indicator features, whereas 


Figure 2: Example of an anchored rule production 
for the rule NP —NP PP. From the anchoring s = 
{i,j,k), we extract either sparse surface features 
fs or a sequence of word indicators fw which are 
embedded to form a vector representation v{fw) 
of the anchoring’s lexical properties. 

in our approach they are nonlinear functions of 
word embeddings.^ Section 2.1 describes our no¬ 
tation for anchored rules, and Section 2.2 talks 
about how they are scored. We then discuss spe¬ 
cific choices of our featurization (Section 2.3) and 
the backbone grammar used for structured infer¬ 
ence (Section 2.4). 

2.1 Anchored Rules 

The fundamental units that our parsing models 
consider are anchored rules. As shown in Fig¬ 
ure 2, we define an anchored rule as a fuple (r, s), 
where r is an indicafor of fhe rule’s idenfify and 
s = {i,j,k) indicates fhe span {i,k) and splif 
poinf j of fhe rule.^ A free T is simply a collec¬ 
tion of anchored rules subjecf fo fhe consfrainf fhaf 
fhose rules form a free. All of our parsing models 
are CRFs fhaf decompose over anchored rule pro- 
ducfions and place a probabilify disfribufion over 
frees conditioned on a sentence w as follows: 

P(T|w) oc exp (j){w,r,s) 1 

\{r,s)&T J 

^Throughout this work, we will primarily consider two 
potential functions: linear functions of sparse indicators and 
nonlinear neural networks over dense, continuous features. 
Although other modeling choices are possible, these two 
points in the design space reflect common choices in NLP, 
and past work has suggested that nonlinear functions of indi¬ 
cators or linear functions of dense features may perform less 
well (Wang and Manning, 2013). 

^For simplicity of exposition, we ignore unary rules; how¬ 
ever, they are easily supported in this framework by simply 
specifying a null value for the split point. 

















































where cf) is a scoring function that considers the 
input sentence and the anchored rule in question. 
Figure 1 shows this scoring process schematically. 
As we will see, the module on the left can be be 
a neural net, a linear function of surface features, 
or a combination of the two, as long as it provides 
anchored rule scores, and the structured inference 
component is the same regardless (CKY). 

A PCFG estimated with maximum likelihood 
has 4>{'w, r, s) = log P(r|parent(r)), which is in¬ 
dependent of the anchoring s and the words w ex¬ 
cept for preterminal productions; a basic discrimi¬ 
native parser might let this be a learned parameter 
but still disregard the surface information. How¬ 
ever, surface features can capture useful syntactic 
cues (Finkel et ah, 2008; Hall et ah, 2014). Con¬ 
sider the example in Figure 2: the proposed parent 
NP is preceded by the word reflected and followed 
by a period, which is a surface context character¬ 
istic of NPs or PPs in object position. Beginning 
with the and ending with personality are typical 
properties of NPs as well, and the choice of the 
particular rule NP —)> NP PP is supported by the 
fact that the proposed child PP begins with of. This 
information can be captured with sparse features 
{fs in Figure 2) or, as we describe below, with a 
neural network taking lexical context as input. 

2.2 Scoring Anchored Rules 

Following Hall et al. (2014), our baseline sparse 
scoring function takes the following bilinear form: 

(/>sparse(w,r, s;fF) = fs{w,s)~'^Wfo{r) 

where fo{r) G {0,1}"^° is a sparse vector of 
features expressing properties of r (such as the 
rule’s identity or its parent label) and /^(w, s) G 
{0,1}”'* is a sparse vector of surface features as¬ 
sociated with the words in the sentence and the 
anchoring, as shown in Figure 2. FF is a x Uq 
matrix of weights.^ The scoring of a particular an¬ 
chored rule is depicted in Figure 3a; note that sur¬ 
face features and rule indicators are conjoined in a 
systematic way. 

The role of fs can be equally well played by a 
vector of dense features learned via a neural net- 

"'A more conventional expression of the scoring function 
for a CRF is ())(w, r, s) = 6'f{w, r, s), with a vector 9 for 
the parameters and a single feature extractor / that jointly 
inspects the surface and the rule. However, when the feature 
representation conjoins each rule r with surface properties of 
the sentence in a systematic way (an assumption that holds in 
our case as well as for standard CRF models for POS tagging 
and NER), this is equivalent to our formalism. 


a) (j) = fjWfo b) </> = g{Hv{fn,))^Wfo 



Figure 3: Our sparse (left) and neural (right) scor¬ 
ing functions for CRF parsing, fs and are 
raw surface feature vectors for the sparse and neu¬ 
ral models (respectively) extracted over anchored 
spans with split points, (a) In the sparse case, 
we multiply fs by a weight matrix W and then 
a sparse output vector fo to score the rule produc¬ 
tion. (b) In the neural case, we first embed and 
then transform it with a one-layer neural network 
in order to produce an intermediate feature repre¬ 
sentation h before combining with W and fo- 

work. We will now describe how to compute these 
features, which represent a transformation of sur¬ 
face lexical indicators /^. Define /«,(w, s) G 
to be a function that produces a fixed-length se¬ 
quence of word indicators based on the input sen¬ 
tence and the anchoring. This vector of word 
identities is then passed to an embedding function 
u : N —)• M"® and the dense representations of 
the words are subsequently concatenated to form 
a vector we denote by v{fw)-^ Finally, we mul¬ 
tiply this by a matrix H G W^hx{nn,ne) 
valued parameters and pass it through an elemen¬ 
twise nonlinearity g{-). We use rectified linear 
units g{x) = max(a:, 0) and discuss this choice 
more in Section 6. 

Replacing fg with the end result of this compu¬ 
tation h{-w,s;H) = g{Hv{fii;{-w,s))), our scor¬ 
ing function becomes 

<?5'neural(w, r, S] H, W) = h{w, s; H^Wfoir) 

as shown in Figure 3b. For a fixed H, this model 
can be viewed as a basic CRF with dense input fea¬ 
tures. By learning H, we learn intermediate fea¬ 
ture representations that provide the model with 

^Embedding words allows us to use standard pre-trained 
vectors more easily and tying embeddings across word posi¬ 
tions substantially reduces the number of model parameters. 
However, embedding features rather than words has also been 
shown to be effective (Chen et al., 2014). 















more discriminating power. Also note that it is 
possible to use deeper networks or more sophis¬ 
ticated architectures here; we will return to this in 
Section 6. 

Our two models can be easily combined: 

— ^sparse {w,r,s;Wi) 

+ (pneum\i^,r,S;H,W2) 

Weights for each component of the scoring func¬ 
tion can be learned fully jointly and inference pro¬ 
ceeds as before. 

2.3 Features 

We take fs to be the set of features described in 
Hall et al. (2014). At the preterminal layer, the 
model considers prefixes and suffixes up fo lengfh 
5 of fhe currenf word and neighboring words, as 
well as fhe words’ identifies. For nonterminal pro¬ 
ductions, we fire indicafors on fhe words^ before 
and affer fhe sfarf, end, and splif poinf of fhe an¬ 
chored rule (as shown in Figure 2) as well as on 
fwo ofher span properfies, span lengfh and span 
shape (an indicator of where capifalized words, 
numbers, and puncfuafion occur in fhe span). 

For our neural model, we lake for all pro- 
ducfions (preferminal and nonferminal) fo be fhe 
words surrounding fhe beginning and end of a span 
and fhe splif poinf, as shown in Figure 2; in parlic- 
ular, we look fwo words in eifher direction around 
each poinf of inleresl, meaning fhe neural nef lakes 
12 words as inpul.^ For our word embeddings v, 
we use pre-lrained word vectors from Bansal el al. 
(2014). We compare wilh ofher sources of word 
veclors in Secfion 5. Conlrary to slandard praclice, 
we do nol updale Ihese veclors during Iraining; we 
found lhaf doing so did nol provide an accuracy 
benefil and slowed down Iraining considerably. 

2.4 Grammar Refinements 

A recurring issue in discriminative constituency 
parsing is the granularity of annotation in the base 
grammar (Finkel et ah, 2008; Petrov and Klein, 
2008; Hall et ah, 2014). Using finer-grained sym¬ 
bols in our rules r gives the model greater capacity, 
but also introduces more parameters into W and 

®The model actually uses the longest suffix of each word 
occurring at least 100 times in the training set, up to the entire 
word. Removing this abstraction of rare words harms perfor¬ 
mance. 

’The sparse model did not benefit from using this larger 
neighborhood, so improvements from the neural net are not 
simply due to considering more lexical context. 


increases the ability to overtit. Following Hall et 
al. (2014), we use grammars with very little anno¬ 
tation: we use no horizontal Markovization for any 
of experiments, and all of our English experiments 
with the neural CRF use no vertical Markovization 
{V = 0). This also has the benefit of making the 
system much faster, due to the smaller state space 
for dynamic programming. We do find that using 
parent annotation {V = 1) is useful on other lan¬ 
guages (see Section 7.2), but this is the only gram¬ 
mar refinement we consider. 


3 Learning 

To learn weights for our neural model, we maxi¬ 
mize the conditional log likelihood of our D train¬ 
ing trees T *: 

D 

C{H, W) = J2 log H, 1^) 

i=l 

Because we are using rectified linear units as our 
nonlinearity, our objective is not everywhere dif¬ 
ferentiable. The interaction of the parameters and 
the nonlinearity also makes the objective non- 
convex. However, in spite of this, we can still fol¬ 
low subgradients to optimize this objective, as is 
standard practice. 

Recall that h{w, s; H) are the hidden layer ac¬ 
tivations. The gradient of W takes the standard 
form of log-linear models: 

E Hw.s-.murA - 

\ir,s)eT* / 

l^PiT\w,H,W) h{w,s;H)foir)^] 

\ T ps)&T ) 

Note that the outer products give matrices of fea¬ 
ture counts isomorphic to W. The second expres¬ 
sion can be simplified to be in terms of expected 
feature counts. To update H, we use standard 
backpropagation by first computing: 


dC 


E 

Ar,s)eT* 


r - 


E^(r|w;F,lU) E ^foir) 

(r,s)GT 


Since h is the output of the neural network, we can 
then apply the chain rule to compute gradients for 
H and any other parameters in the neural network. 



Learning uses Adadelta (Zeiler, 2012), which 
has been employed in past work (Kim, 2014). We 
found that Adagrad (Duchi et ah, 2011) performed 
equally well with tuned regularization and step 
size parameters, but Adadelta worked better out 
of the box. We set the momentum term p = 0.95 
(as suggested by Zeiler (2012)) and did not reg¬ 
ularize the weights at all. We used a minihatch 
size of 200 trees, although the system was not par¬ 
ticularly sensitive to this. For each treebank, we 
trained for either 10 passes through the treebank 
or 1000 minibatches, whichever is shorter. 

We initialized the output weight matrix W to 
zero. To break symmetry, the lower level neural 
network parameters H were initialized with each 
entry being independently sampled from a Gaus¬ 
sian with mean 0 and variance 0.01; Gaussian per¬ 
formed better than uniform initialization, but the 
variance was not important. 

4 Inference 

Our baseline and neural model both score an¬ 
chored rule productions. We can use CKY in the 
standard fashion to compute either expected an¬ 
chored rule counts Ep( 2 ’|w) [(r, s)] or the Viterbi 
tree argmaxp P(T|w). 

We speed up inference by using a coarse prun¬ 
ing pass. We follow Hall et al. (2014) and 
prune according to an X-bar grammar with head- 
outward binarization, ruling out any constituent 
whose max marginal probability is less than e~^. 
With this pruning, the number of spans and split 
points to be considered is greatly reduced; how¬ 
ever, we still need to compute the neural network 
activations for each remaining span and split point, 
of which there may be thousands for a given sen¬ 
tence.^ We can improve efficiency further by not¬ 
ing that the same word will appear in the same po¬ 
sition in a large number of span/split point combi¬ 
nations, and cache the contribution to the hidden 
layer caused by that word (Chen and Manning, 
2014). Computing the hidden layer then simply 
requires adding Uw vectors together and applying 
the nonlinearity, instead of a more costly matrix 
multiply. 

Because the number of rule indicators Uo is 
fairly large (approximately 4000 in the Penn Tree- 
bank), the multiplication by W in the model is also 

*One reason we did not choose to include the rule identity 
fo as an input to the network is that it requires computing an 
even larger number of network activations, since we cannot 
reuse them across rules over the same span and split point. 


expensive. However, because only a small number 
of rules can apply to a given span and split point, 
fo is sparse and we can selectively compute the 
terms necessary for the final bilinear product. 

Our combined sparse and neural model trains on 
the Penn Treebank in 24 hours on a single machine 
with a parallelized CPU implementation. For ref¬ 
erence, the purely sparse model with a parent- 
annotated grammar (necessary for the best results) 
takes around 15 hours on the same machine. 

5 System Ablations 

Table 1 shows results on section 22 (the develop¬ 
ment set) of the English Penn Treebank (Marcus 
et ah, 1993), computed using evalb. Full test re¬ 
sults and comparisons to other systems are shown 
in Table 4. We compare variants of our system 
along two axes: whether they use standard linear 
sparse features, nonlinear dense features from the 
neural net, or both, and whether any word repre¬ 
sentations (vectors or clusters) are used. 

Sparse vs. neural The neural CRF (line (d) in 
Table 1) on its own outperforms the sparse CRF 
(a, b) even when the sparse CRF has a more heav¬ 
ily annotated grammar. This is a surprising re¬ 
sult: the features in the sparse CRF have been 
carefully engineered to capture a range of linguis¬ 
tic phenomena (Hall et ah, 2014), and there is 
no guarantee that word vectors will capture the 
same. For example, at the PCS tagging layer, 
the sparse model looks at prefixes and suffixes of 
words, which give the model access to morphol¬ 
ogy for predicting tags of unknown words, which 
typically have regular inflection patterns. By con¬ 
trast, the neural model must rely on the geometry 
of the vector space exposing useful regularities. 
At the same time, the strong performance of the 
combination of the two systems (g) indicates that 
not only are both featurization approaches high- 
performing on their own, but that they have com¬ 
plementary strengths. 

Unlabeled data Much attention has been paid 
to the choice of word vectors for various NLP 
tasks, notably whether they capture more syntac¬ 
tic or semantic phenomena (Bansal et ah, 2014; 
Levy and Goldberg, 2014). We primarily use vec¬ 
tors from Bansal et al. (2014), who train the skip- 
gram model of Mikolov et al. (2013) using con¬ 
texts from dependency links; a similar approach 
was also suggested by Levy and Goldberg (2014). 
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Table 1: Results of our sparse CRF, neural CRF, 
and eombined parsing models on seetion 22 of 
the Penn Treebank. Systems are broken down 
by whether loeal potentials eome from sparse 
features and/or the neural network (the primary 
eontribution of this work), their level of vertieal 
Markovization, and what kind of word represen¬ 
tations they use. The neural CRF (d) outperforms 
the sparse CRF (a, b) even when a more heavily 
annotated grammar is used, and the eombined ap- 
proaeh (g) is substantially better than either indi¬ 
vidual model. The eontribution of the neural ar- 
ehiteeture eannot be replaeed by Brown elusters 
(e), and even word representations learned just on 
the Penn Treebank are surprisingly effeetive (f, h). 

However, as these embeddings are trained on a 
relatively small eorpus (BLLIP minus the Penn 
Treebank), it is natural to wonder whether less- 
syntaetie embeddings trained on a larger eorpus 
might be more useful. This is not the ease: line 

(e) in Table 1 shows the performanee of the neu¬ 
ral CRF using the Wikipedia-trained word embed¬ 
dings of Collobert et al. (2011), whieh do not per¬ 
form better than the veetors of Bansal et al. (2014). 

To isolate the eontribution of eontinuous word 
representations themselves, we also experimented 
with veetors trained on just the text from the train¬ 
ing set of the Penn Treebank using the skip-gram 
model with a window size of 1. While these vee¬ 
tors are somewhat lower performing on their own 

(f) , they still provide a surprising and notieeable 
gain when staeked on top of sparse features (h), 
again suggesting that dense and sparse represen¬ 
tations have eomplementary strengths. This result 
also reinforees the notion that the utility of word 
veetors does not eome primarily from importing 
information about out-of-voeabulary words (An¬ 
dreas and Klein, 2014). 

Sinee the neural features ineorporate informa¬ 
tion from unlabeled data, we should provide the 
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Table 2: Exploration of other implementation 
ehoiees in the feedforward neural network on sen- 
tenees of length < 40 from seetion 22 of the Penn 
Treebank. Reetified linear units perform better 
than tanh or eubie units, a network with one hid¬ 
den layer performs best, and embedding the output 
feature veetor gives worse performanee. 

sparse model with similar information for a true 
apples-to-apples eomparison. Brown elusters have 
been shown to be effeetive vehieles in the past 
(Koo et al., 2008; Turian et al., 2010; Bansal et al., 
2014). We ean ineorporate Brown elusters into the 
baseline CRF model in an analogous way to how 
embedding features are used in the dense model: 
surfaee features are fired on Brown eluster iden¬ 
tities (we use prefixes of lengfh 4 and 10) of key 
words. We use fhe Brown elusfers from Koo ef al. 
(2008), whieh are frained on fhe same dafa as fhe 
veefors of Bansal el al. (2014). However, Table 1 
shows Ihaf fhese feafures provide no benefif lo fhe 
baseline model, whieh suggesls eifher lhal if is dif- 
fieulf lo learn reliable weighls for fhese as sparse 
feafures or lhal differenl regularilies are being eap- 
lured by fhe word embeddings. 

6 Design Choices 

The neural nel design spaee is large, so we wish 
lo analyze Ihe parlieular design ehoiees we made 
for Ihis system by examining Ihe performanee of 
several varianls of Ihe neural nel arehileelure used 
in our system. Table 2 shows developmenl re- 
sulls from polenlial alternate arehileelural ehoiees, 
whieh we now diseuss. 

Choice of nonlinearity The ehoiee of nonlin¬ 
earity g has been frequenlly diseussed in Ihe neural 
nelwork lileralure. Our ehoiee g{x) = max(x, 0), 
a reelilied linear unil, is inereasingly popular in 



computer vision (Krizhevsky et al., 2012). g{x) = 
tanh(x) is a traditional nonlinearity widely used 
throughout the history of neural nets (Bengio et 
ah, 2003). g{x) = x^ (cube) was found to be most 
successful by Chen and Manning (2014). 

Table 2 compares the performance of these 
three nonlinearities. We see that rectified linear 
units perform the best, followed by tanh units, 
followed by cubic units.^ One drawback of tanh 
as an activation function is that it is easily “satu¬ 
rated” if the input to the unit is too far away from 
zero, causing the backpropagation of derivatives 
through that unit to essentially cease; this is known 
to cause problems for training, requiring special 
purpose machinery for use in deep networks (Ioffe 
and Szegedy, 2015). 

Depth Given that we are using rectified linear 
unifs, if bears asking whefher or nol our imple- 
menfafion is improving subsfanlially over linear 
fealures of fhe confinuous inpuf. We can use fhe 
embedding vector of an anchored span v{fw) di- 
recfly as inpuf to a basic linear CRF, as shown in 
Figure 4a. Table 1 shows fhaf fhe purely linear ar- 
chifecfure (0 HL) performs surprisingly well, buf 
is sfill less effeclive fhan fhe nefwork wifh one hid¬ 
den layer. This agrees wifh fhe resulfs of Wang 
and Manning (2013), who noted fhaf dense fea¬ 
lures lypically benefif from nonlinear modeling. 
We also compare againsl a fwo-layer neural nel- 
work, buf find fhaf Ihis also performs worse fhan 
fhe one-layer archileclure. 

Densifying output features Overall, it appears 
beneficial to use dense representations of surface 
features; a natural question that one might ask is 
whether the same technique can be applied to the 
sparse output feature vector fo- We can apply the 
approach of Srikumar and Manning (2014) and 
multiply the sparse output vector by a dense matrix 
K, giving the following scoring function (shown 
in Figure 4b): 

(f>{w,r,s-,H,W,K) = g{Hv{U{w,s))yWKfo{r) 

where W is now Uh x Uoe and K is Uoe x Uo- 
WK can be seen a low-rank approximation of the 
original W at the output layer, similar to low-rank 
factorizations of parameter matrices used in past 

®The performance of cube decreased substantially late in 
learning; it peaked at around 90.52. Dropout may be useful 
for alleviating this type of overfitting, but in our experiments 
we did not find dropout to be beneficial overall. 


a) <^ = viUfWL b) 0 = g{Hv{f^)yWKf„ 
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Figure 4: Two additional forms of the scoring 
function, a) Linear version of the dense model, 
equivalent to a CRF with continuous-valued input 
features, b) Version of the dense model where out¬ 
puts are also embedded according to a learned ma¬ 
trix K. 

work (Lei et ah, 2014). This approach saves us 
from having to learn a separate row of W for ev¬ 
ery rule in the grammar; if rules are given similar 
embeddings, then they will behave similarly ac¬ 
cording to the model. 

We experimented with Uoe = 20 and show the 
results in Table 2. Unfortunately, this approach 
does not seem to work well for parsing. Learn¬ 
ing the output representation was empirically very 
unstable, and it also required careful initialization. 
We tried Gaussian initialization (as in the rest of 
our model) and initializing the model by clustering 
rules either randomly or according to their parent 
symbol. The latter is what is shown in the table, 
and gave substantially better performance. We hy¬ 
pothesize that blurring distinctions between output 
classes may harm the model’s ability to differenti¬ 
ate between closely-related symbols, which is re¬ 
quired for good parsing performance. Using pre¬ 
trained rule embeddings at this layer might also 
improve performance of this method. 

7 Test Results 

We evaluate our system under two conditions: 
first, on the English Penn Treebank, and second, 
on the nine languages used in the SPMRL 2013 
and 2014 shared tasks. 

7.1 Penn Treebank 

Table 4 reports results on section 23 of the Penn 
Treebank (PTB). We focus our comparison on sin¬ 
gle parser systems as opposed to rerankers, ensem¬ 
bles, or self-trained methods (though these are also 
mentioned for context). First, we compare against 
















Arabic 

Basque 

French 

1 German 

Hebrew 

Hungarian 

Korean 

Polish 

Swedish 

Avg 

Dev, all lengths 

Hall et al. (2014) 

78.89 

83.74 

79.40 

83.28 

88.06 

87.44 

81.85 

91.10 

75.95 

83.30 

This work* 

80.68 

84.37 

80.65 

85.25 

89.37 

89.46 

82.35 

92.10 

77.93 

84.68 

Test, all lengths 

Berkeley 

79.19 

70.50 

80.38 

78.30 

86.96 

81.62 

71.42 

79.23 

79.18 

78.53 

Berkeley-Tags 

78.66 

74.74 

79.76 

78.28 

85.42 

85.22 

78.56 

86.75 

80.64 

80.89 

Crabbe and Seddah (2014) 

77.66 

85.35 

79.68 

77.15 

86.19 

87.51 

79.35 

91.60 

82.72 

83.02 

Hall et al. (2014) 

78.75 

83.39 

79.70 

78.43 

87.18 

88.25 

80.18 

90.66 

82.00 

83.17 

This work* 

80.24 

85.41 

81.25 

80.95 

88.61 

90.66 

82.23 

92.97 

83.45 

85.08 

Reranked ensemble 

2014 Best 

81.32 

88.24 

82.53 

81.66 

89.80 

91.72 

83.81 

90.50 

85.50 

86.12 


Table 3: Results for the nine treebanks in the SPMRL 2013/2014 Shared Tasks; all values are F-seores 
for sentenees of all lengths using the version of evalb distributed with the shared task. Our parser 
substantially outperforms the strongest single parser results on this dataset (Hall et al., 2014; Crabbe and 
Seddah, 2014). Berkeley-Tags is an improved version of the Berkeley parser designed for the shared task 
(Seddah et al., 2013). 2014 Best is a reranked ensemble of modified Berkeley parsers and eonstitutes the 
best published numbers on this dataset (Bjorkelund et al., 2013; Bjorkelund et al., 2014). 


Fi all 


Single model, PTB only 

Hall et al. (2014) 

89.2 

Berkeley 

90.1 

Carreras et al. (2008) 

91.1 

Shindo et al. (2012) single 

91.1 

Single model, PTB -i- veetors/elusters 

Zhu et al. (2013) 

91.3 

This work* 

91.1 

Extended eonditions 

Charniak and Johnson (2005) 

91.5 

Soeher et al. (2013) 

90.4 

Vinyals et al. (2014) single 

90.5 

Vinyals et al. (2014) ensemble 

91.6 

Shindo et al. (2012) ensemble 

92.4 


Table 4: Test results on seetion 23 of the Penn 
Treebank. We eompare to several eategories of 
parsers from the literatures. We outperform strong 
baselines sueh as the Berkeley Parser (Petrov and 
Klein, 2007) and the CVG Stanford parser (Soeher 
et al., 2013) and we mateh the performanee of so- 
phistieated generative (Shindo et al., 2012) and 
diseriminative (Carreras et al., 2008) parsers. 

four parsers trained only on the PTB with no aux¬ 
iliary data: the CRF parser of Hall et al. (2014), 
the Berkeley parser (Petrov and Klein, 2007), the 
diseriminative parser of Carreras et al. (2008), and 


the single TSG parser of Shindo et al. (2012). To 
our knowledge, the latter two systems are the high¬ 
est performing in this PTB-only, single parser data 
eondition; we mateh their performanee at 91.1 Fi, 
though we also use word veetors eomputed from 
unlabeled data. We further eompare to the shift- 
reduee parser of Zhu et al. (2013), whieh uses un¬ 
labeled data in the form of Brown elusters. Our 
method aehieves performanee elose to that of their 
parser. 

We also eompare to the eompositional veetor 
grammar (CVG) parser of Soeher et al. (2013) 
as well as the LSTM-based parser of Vinyals et 
al. (2014). The eonditions these parsers are op¬ 
erating under are slightly different: the former is 
a reranker on top of the Stanford Parser (Klein 
and Manning, 2003) and the latter trains on mueh 
larger amounts of data parsed by a produet of 
Berkeley parsers (Petrov, 2010). Regardless, we 
outperform the CVG parser as well as the single 
parser results from Vinyals et al. (2014). 

7.2 SPMRL 

We also examine the performanee of our 
parser on other languages, speeifieally the 
nine morphologieally-rieh languages used in the 
SPMRL 2013/2014 shared tasks (Seddah et al., 
2013; Seddah et al., 2014). We train word vee¬ 
tors on the monolingual data distributed with the 
SPMRL 2014 shared task (typieally 100M-200M 
tokens per language) using the skip-gram ap- 
proaeh of word2vec with a window size of 1 










(Mikolov et al., 2013).'° Here we use V = 1 
in the baekbone grammar, whieh we found to be 
benefieial overall. Table 3 shows that our system 
improves upon the performanee of the parser from 
Hall et al. (2014) as well as the top single parser 
from the shared task (Crabbe and Seddah, 2014), 
with robust improvements on all languages. 

8 Conclusion 

In this work, we presented a CRF parser that 
seores anehored rule produetions using dense in¬ 
put features eomputed from a feedforward neu¬ 
ral net. Beeause the neural eomponent is mod¬ 
ularized, we ean easily integrate it into a pre¬ 
existing learning and inferenee framework based 
around dynamie programming of a diserete parse 
ehart. Our eombined neural and sparse model 
gives strong performanee both on English and on 
other languages. 

Our system is publiely available at 
http://nlp.cs.berkeley.edu. 
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